Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Nat Plants ; 10(3): 390-401, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-38467801

RESUMO

Scientific testing including stable isotope ratio analysis (SIRA) and trace element analysis (TEA) is critical for establishing plant origin, tackling deforestation and enforcing economic sanctions. Yet methods combining SIRA and TEA into robust models for origin verification and determination are lacking. Here we report a (1) large Eastern European timber reference database (Betula, Fagus, Pinus, Quercus) tailored to sanctioned products following the Ukraine invasion; (2) statistical test to verify samples against a claimed origin; (3) probabilistic model of SIRA, TEA and genus distribution data, using Gaussian processes, to determine timber harvest location. Our verification method rejects 40-60% of simulated false claims, depending on the spatial scale of the claim, and maintains a low probability of rejecting correct origin claims. Our determination method predicts harvest location within 180 to 230 km of true location. Our results showcase the power of combining data types with probabilistic modelling to identify and scrutinize timber harvest location claims.


Assuntos
Fagus , Pinus , Ucrânia , Betula , Genes de Plantas
2.
Syst Biol ; 72(5): 1199-1206, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37498209

RESUMO

Bayesian phylogenetics is now facing a critical point. Over the last 20 years, Bayesian methods have reshaped phylogenetic inference and gained widespread popularity due to their high accuracy, the ability to quantify the uncertainty of inferences and the possibility of accommodating multiple aspects of evolutionary processes in the models that are used. Unfortunately, Bayesian methods are computationally expensive, and typical applications involve at most a few hundred sequences. This is problematic in the age of rapidly expanding genomic data and increasing scope of evolutionary analyses, forcing researchers to resort to less accurate but faster methods, such as maximum parsimony and maximum likelihood. Does this spell doom for Bayesian methods? Not necessarily. Here, we discuss some recently proposed approaches that could help scale up Bayesian analyses of evolutionary problems considerably. We focus on two particular aspects: online phylogenetics, where new data sequences are added to existing analyses, and alternatives to Markov chain Monte Carlo (MCMC) for scalable Bayesian inference. We identify 5 specific challenges and discuss how they might be overcome. We believe that online phylogenetic approaches and Sequential Monte Carlo hold great promise and could potentially speed up tree inference by orders of magnitude. We call for collaborative efforts to speed up the development of methods for real-time tree expansion through online phylogenetics.


Assuntos
Evolução Biológica , Modelos Genéticos , Filogenia , Teorema de Bayes , Método de Monte Carlo , Cadeias de Markov
3.
BMC Bioinformatics ; 22(1): 285, 2021 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-34049487

RESUMO

BACKGROUND: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented. RESULTS: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. CONCLUSIONS: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.


Assuntos
Algoritmos , Biologia Computacional , Filogenia , Alinhamento de Sequência
4.
Theor Popul Biol ; 137: 22-31, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33333117

RESUMO

The multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between a gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to efficiently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and efficient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyze data sets where thousands of genes are sampled in a matter of minutes to hours.


Assuntos
Algoritmos , Modelos Genéticos , Especiação Genética , Filogenia , Probabilidade
5.
J Math Biol ; 79(2): 485-508, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31037350

RESUMO

The transfer distance (TD) was introduced in the classification framework and studied in the context of phylogenetic tree matching. Recently, Lemoine et al. (Nature 556(7702):452-456, 2018. https://doi.org/10.1038/s41586-018-0043-0 ) showed that TD can be a powerful tool to assess the branch support on large phylogenies, thus providing a relevant alternative to Felsenstein's bootstrap. This distance allows a reference branch[Formula: see text] in a reference tree [Formula: see text] to be compared to a branch b from another tree T (typically a bootstrap tree), both on the same set of n taxa. The TD between these branches is the number of taxa that must be transferred from one side of b to the other in order to obtain [Formula: see text]. By taking the minimum TD from [Formula: see text] to all branches in T we define the transfer index, denoted by [Formula: see text], measuring the degree of agreement of T with [Formula: see text]. Let us consider a reference branch [Formula: see text] having p tips on its light side and define the transfer support (TS) as [Formula: see text]. Lemoine et al. (2018) used computer simulations to show that the TS defined in this manner is close to 0 for random "bootstrap" trees. In this paper, we demonstrate that result mathematically: when T is randomly drawn, TS converges in probability to 0 when n tends to [Formula: see text]. Moreover, we fully characterize the distribution of [Formula: see text] on caterpillar trees, indicating that the convergence is fast, and that even when n is small, moderate levels of branch support cannot appear by chance.


Assuntos
Transferência Genética Horizontal , Modelos Genéticos , Filogenia , Algoritmos , Simulação por Computador
6.
Syst Biol ; 65(2): 328-33, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26615177

RESUMO

We prove that maximum likelihood phylogenetic inference is consistent on gapped multiple sequence alignments (MSAs) as long as substitution rates across each edge are greater than zero, under mild assumptions on the structure of the alignment. Under these assumptions, maximum likelihood will asymptotically recover the tree with edge lengths corresponding to the mean number of substitutions per site on each edge. This refutes Warnow's recent suggestion (Warnow 2012) that maximum likelihood phylogenetic inference might be statistically inconsistent when gaps are treated as missing data, even if the MSA is correct. We also derive a simple new proof of maximum likelihood consistency of ungapped alignments.


Assuntos
Classificação/métodos , Simulação por Computador , Filogenia , Alinhamento de Sequência
7.
Pac Symp Biocomput ; : 310-9, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23424136

RESUMO

We consider the problem of phylogenetic placement, in which large numbers of sequences (often next-generation sequencing reads) are placed onto an existing phylogenetic tree. We adapt our recent work on phylogenetic tree inference, which uses ancestral sequence reconstruction and locality-sensitive hashing, to this domain. With these ideas, new sequences can be placed onto trees with high fidelity in strikingly fast runtimes. Our results are two orders of magnitude faster than existing programs for this domain, and show a modest accuracy tradeoff. Our results offer the possibility of analyzing many more reads in a next-generation sequencing project than is currently possible.


Assuntos
Algoritmos , Filogenia , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Evolução Molecular , Humanos , Metagenômica/estatística & dados numéricos , Microbiota/genética , Alinhamento de Sequência/estatística & dados numéricos , Software
8.
Algorithms Mol Biol ; 7(1): 32, 2012 Nov 26.
Artigo em Inglês | MEDLINE | ID: mdl-23181935

RESUMO

: Recently, we have identified a randomized quartet phylogeny algorithm that has O(nlogn) runtime with high probability, which is asymptotically optimal. Our algorithm has high probability of returning the correct phylogeny when quartet errors are independent and occur with known probability, and when the algorithm uses a guide tree on O(loglogn) taxa that is correct with high probability. In practice, none of these assumptions is correct: quartet errors are positively correlated and occur with unknown probability, and the guide tree is often error prone. Here, we bring our work out of the purely theoretical setting. We present a variety of extensions which, while only slowing the algorithm down by a constant factor, make its performance nearly comparable to that of Neighbour Joining , which requires Θ(n3) runtime in existing implementations. Our results suggest a new direction for quartet-based phylogenetic reconstruction that may yield striking speed improvements at minimal accuracy cost. An early prototype implementation of our software is available at http://www.cs.uwaterloo.ca/jmtruszk/qtree.tar.gz.

9.
BMC Bioinformatics ; 13: 31, 2012 Feb 14.
Artigo em Inglês | MEDLINE | ID: mdl-22333067

RESUMO

BACKGROUND: Illumina paired-end reads are used to analyse microbial communities by targeting amplicons of the 16S rRNA gene. Publicly available tools are needed to assemble overlapping paired-end reads while correcting mismatches and uncalled bases; many errors could be corrected to obtain higher sequence yields using quality information. RESULTS: PANDAseq assembles paired-end reads rapidly and with the correction of most errors. Uncertain error corrections come from reads with many low-quality bases identified by upstream processing. Benchmarks were done using real error masks on simulated data, a pure source template, and a pooled template of genomic DNA from known organisms. PANDAseq assembled reads more rapidly and with reduced error incorporation compared to alternative methods. CONCLUSIONS: PANDAseq rapidly assembles sequences and scales to billions of paired-end reads. Assembly of control libraries showed a 4-50% increase in the number of assembled sequences over naïve assembly with negligible loss of "good" sequence.


Assuntos
Bactérias/isolamento & purificação , Metagenômica , Software , Bactérias/genética , RNA Bacteriano/genética , RNA Ribossômico 16S/genética
10.
BMC Bioinformatics ; 12: 168, 2011 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-21586147

RESUMO

BACKGROUND: Identifying recombinations in HIV is important for studying the epidemiology of the virus and aids in the design of potential vaccines and treatments. The previous widely-used tool for this task uses the Viterbi algorithm in a hidden Markov model to model recombinant sequences. RESULTS: We apply a new decoding algorithm for this HMM that improves prediction accuracy. Exactly locating breakpoints is usually impossible, since different subtypes are highly conserved in some sequence regions. Our algorithm identifies these sites up to a certain error tolerance. Our new algorithm is more accurate in predicting the location of recombination breakpoints. Our implementation of the algorithm is available at http://www.cs.uwaterloo.ca/~jmtruszk/jphmm_balls.tar.gz. CONCLUSIONS: By explicitly accounting for uncertainty in breakpoint positions, our algorithm offers more reliable predictions of recombination breakpoints in HIV-1. We also document a new domain of use for our new decoding approach in HMMs.


Assuntos
Algoritmos , HIV-1/genética , Cadeias de Markov , Recombinação Genética , Genoma Viral , Humanos
11.
BMC Bioinformatics ; 11 Suppl 1: S40, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122214

RESUMO

BACKGROUND: Existing hidden Markov model decoding algorithms do not focus on approximately identifying the sequence feature boundaries. RESULTS: We give a set of algorithms to compute the conditional probability of all labellings "near" a reference labelling lambda for a sequence y for a variety of definitions of "near". In addition, we give optimization algorithms to find the best labelling for a sequence in the robust sense of having all of its feature boundaries nearly correct. Natural problems in this domain are NP-hard to optimize. For membrane proteins, our algorithms find the approximate topology of such proteins with comparable success to existing programs, while being substantially more accurate in estimating the positions of transmembrane helix boundaries. CONCLUSION: More robust HMM decoding may allow for better analysis of sequence features, in reasonable runtimes.


Assuntos
Algoritmos , Cadeias de Markov , Proteínas de Membrana/química , Bases de Dados de Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA